Learning Low-Density Separators 
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Abstract. We define a novel, basic, unsupervised learning problem - learning the lowest density 
homogeneous hyperplane separator of an unknown probability distribution. This task is relevant 
^ , to several problems in machine learning, such as semi-supervised learning and clustering stability. 

We investigate the question of existence of a universally consistent algorithm for this problem. 
We propose two natural learning paradigms and prove that, on input unlabeled random samples 
generated by any member of a rich family of distributions, they are guaranteed to converge to 
the optimal separator for that distribution. We complement this result by showing that no learn- 
ing algorithm for our task can achieve uniform learning rates (that are independent of the data 
. generating distribution). 

^ ■ 1 Introduction 



While the theory of machine learning has achieved extensive understanding of many aspects of supervised 
learning, our theoretical understanding of unsupervised learning leaves a lot to be desired. In spite of the 
obvious practical importance of various unsupervised learning tasks, the state of our current knowledge 
0^ ■ does not provide anything that comes close to the rigorous mathematical performance guarantees that 

OO ' classification prediction theory enjoys. 

. In this paper we make a small step in that direction by analyzing one specific unsupervised learning 

' task - the detection of low-density linear separators for data distributions over Euclidean spaces. 

. We consider the following task: for an unknown data distribution over , find the homogeneous 

hyperplane of lowest density that cuts through that distribution. We assume that the underlying data 
distribution has a continuous density function and that the data available to the learner are finite i.i.d. 
samples of that distribution. 

Our model can be viewed as a restricted instance of the fundamental issue of inferring information 
^ ' about a probability distribution from the random samples it generates. Tasks of that nature range from 

the ambitious problem of density estimation [8], through estimation of level sets [4], [13], [1], densest 
region detection [3], and, of course, clustering. All of these tasks are notoriously difficult with respect 
to both the sample complexity and the computational complexity aspects (unless one presumes strong 
restrictions about the nature of the underlying data distribution). Our task seems more modest than 
these. Although we are not aware of any previous work on this problem (from the point of view of 
statistical machine learning, at least), we believe that it is a rather basic problem that is relevant to 
various practical learning scenarios. 

One important domain to which the detection of low-density linear data separators is relevant is 
semi-supervised learning [7] . Semi-supervised learning is motivated by the fact that in many real world 
classification problems, unlabeled samples are much cheaper and easier to obtain than labeled examples. 
Consequently, there is great incentive to develop tools by which such unlabeled samples can be utilized 
to improve the quality of sample based classifiers. Naturally, the utility of unlabeled data to classification 



depends on assuming some relationship between the unlabeled data distribution and the class member- 
ship of data points (see [5] for a rigorous discussion of this point). A common postulate of that type 
is that the boundary between data classes passes through low-density regions of the data distribution. 
The Transductive Support Vector Machines paradigm (TSVM) [9] is an example of an algorithm that 
implicitly uses such a low density boundary assumption. Roughly speaking, TSVM searches for a hyper- 
plane that has small error on the labeled data and at the same time has wide margin with respect to the 
unlabeled data sample. 

Another area in which low-density boundaries play a significant role is the analysis of clustering 
stability. Recent work on the analysis of clustering stability found close relationship between the stability 
of a clustering and the data density along the cluster boundaries - roughly speaking, the lower these 
densities the more stable the clustering ([6], [12]). 

A low-density-cut algorithm for a family T of probability distributions takes as an input a finite 
sample generated by some distribution f ^ T and has to output a hyperplane through the origin 
with low density w.r.t. /. In particular, we consider the family of all distributions over R" that have 
continuous density functions. We investigate two notions of success for low-density-cut algorithms - 
uniform convergence (over a family of probability distributions) and consistency. For uniform convergence 
we prove a general negative result, showing that no algorithm can guarantee any fixed convergence rates 
(in terms of sample sizes). This negative result holds even in the simplest case where the data domain 
is the one-dimensional unit interval. For consistency (e.g., allowing the learning/convergence rates to 
depend on the data-generating distribution), we prove the success of two natural algorithmic paradigms; 
Soft-Margin algorithms that choose a margin parameter (depending on the sample size) and output the 
separator with lowest empirical weight in the margins around it, and Hard-Margin algorithms that choose 
the separator with widest sample- free margins. 

The paper is organized as follows: Section 2 provides the formal definition of our learning task as 
well as the success criteria that we investigate. In Section 3 we present two natural learning paradigms 
for the problem over the real line and prove their universal consistency over a rich class of probability 
distributions. Section 4 extends these results to show the learnability of lowest-density homogeneous 
linear cuts for probability distributions over R'^ for arbitrary dimension, d. In Section 5 we show that 
the previous universal consistency results cannot be improved to obtain uniform learning rates (by any 
finite-sample based algorithm). We conclude the paper with a discussion of directions for further research. 

2 Preliminciries 

We consider probability distributions over W^. For concreteness, let the domain of the distribution be 
the d-dimensional unit ball. 

A linear cut learning algorithm is an algorithm that takes as input a finite set of domain points, a 
sample S C M'', and outputs a homogenous hyperplane, L{S) (determined by a weight vector, w e W^, 

such that ||w||2 = 1). 

We investigate algorithms that aim to detect hyperplanes with low density with respect to the sample- 
generating probability distribution. 

Let / : R'' ^ be a rf-dimensional density function. We assume that / is continuous. For any 
homogeneous hyperplane /i(w) = {x e M'' : w^x = 0} defined by a unit weight vector w G W^, we 
consider the {d — l)-dimensional integral of the density over h, 

7(w) := / /(x) Ax . 

Jh{w) 

Note that w /(w) is a continuous mapping defined on the (d— l)-sphere»S''~^ = {w e R'' : l|w||2 = 
1}. Note that, for any such weight vector w, /(w) = /(— w). For the 1-dimensional case, these hyperplanes 
are replaced by points, x on the real line, and /(x) = /(x) - the density at the point x. 



Definition 1. A linear cut learning algorithm is a function that maps samples to homogeneous hyper- 
planes. Namely, 

oo 
m— 1 

When d= 1, we require that 

oo 

L: y ^ [0,1]. 

m=l 

(The intention is that L finds the lowest density linear separator of the sample generating distribution.) 

Definition 2. Let he a probability distribution and f its density function. For a weight vector w we 
define the half-spaces /i+(w) = {x e M"^ : w'^x > 0} and h~{w) = {x e M"^ : w'^x < 0}. For any 
weight vectors w and w' , 

1. De{'w,w') = 1 — jw'^w'l 

2. D^(w,wO =min{/x(/i+(w)Z\/i+(w')),Ai(/i-(w)Z\/i+(w'))} 

3. D/(w,w') = |/(w')-/(w)| 

We shall mostly consider the distance measure De in R'^, for > 1 and DE{x,y) = \x — y\ for 
x,y eM.. In theses cases wc omit any explicit reference to D. All of our results hold as well when D is 
taken to be the probability mass of the symmetric difference between L{S) and w* and when D is taken 
to be £»(w,w') = |7(w)-7(w')|. 

Definition 3. Let J- denote a family of probability distributions over . We assume that all members 
of T have density functions, and identify a distribution with its density function. Let D denote a distance 
function over hyperplanes. For a linear cut learning algorithm, L, as above, 

1. We say that L, is consistent for T w.r.t a distance m,ea,sure D, if, for any probability distribution f 
in T , if f attains a unique minimum density hyperplane then 

Ve>0 lim Pr [D{L{S),w*) > e] = 

where w* is the minimum density hyperplane for f. 

2. We say that L is uniformly convergent for T (w.r.t a distance measure, D), if, for every e,(5 > 0, 
there exists a m(e, 5) such that for any probability distribution f & T, if f has a unique minimizer 
w* then, for all m > m(e, 5) we have 

Vr [D{L{S),^*)>e]<5. (2) 

3 The One Dimensional Problem 

Let J^i be the family of all probability distributions over the unit interval [0, 1] that have continuous 
density function. We consider two natural algorithms for lowest density cut over this family. The first 
is a simple bucketing algorithm. We explain it in detail and show its consistency in section 3.1. The 
second algorithm is the hard-margin algorithm which outputs the mid-point of the largest gap between 
two consecutive points the sample. In section 3.2 we show hard-margin algorithm is consistent and in 
section 3.1 that the bucketing algorithm is consistent. In section 5 we show there are no algorithms that 
are uniformly convergent for Ti. 



(1) 



3.1 The Bucketing Algorithm 



The algorithm is parameterized by a function fc : N ^ N. For a sample of size m, the algorithm splits 
the interval [0, 1] into fc(m) equal length subintervals (buckets). Given an input sample S, it counts the 
number of sample points lying in each bucket and outputs the mid-point of the bucket with fewest sample 
points. In case of ties, it picks the rightmost bucket. We denote this algorithm by Bk- As it turns out, 
there exists a choice of k{m) which makes the algorithm Bk consistent for J^i. 

Theorem 1. // the number of buckets k{m) = o{^/m) and k{m) —> oo as m —> oo, then the bucketing 
algorithm B^ is consistent for T\ . 

Proof. Fix / G J^i, assume / has a imiquc minimizer x* . Fix e,d > 0. Let U = (x* — e/2,x* + e/2) 
be an neighbourhood of the unique minimizer x* . The set [0, 1] \ C/ is compact and hence there exists 
a := min/([0, 1] \ U). Since x* is the unique minimizer of /, a > f{x*) and hence r/ := a — f{x*) is 
positive. Thus, we can pick a neighbourhood V oi x* , V C U, such that for all x E V, f{x) < a — r]/2. 
The assumptions on growth of fc(m) imply that there exists mo such that for all m > tuq 

l/k{m) < \V\/2 (3) 



Hl/S) ^ (4) 
m 2k{m) 

Fix any m > TOq. Divide [0, 1] into k{m) buckets each of length l/k{m). For any bucket /, Jfl i7 = 0, 

^ *R ■ 

Since l/k{m) < \V\/2 there exists a bucket J such that J C-V. Furthermore, 

For a bucket /, we denote by \I (1 S\ the number of sample points in the bucket /. From the well 
known Vapnik-Chervonenkis bounds [2], we have that with probability at least 1 — 5 over i.i.d. draws of 
sample S of size m, for any bucket I, 



< ■ (7) 



m 

Fix any sample S satisfying the inequality (7) . For any bucket I, I f\U = $ 



m V m 



k{m) V m 



K{m) \ m \ m 



i.m-J'^ by (5) 



m 



< ^ by (7) 



m 



Since IJOSj > |JnS'|, the algorithm B/. must not output the mid-point of any bucket I for which 
Ir\U = 0. Henceforth, the algorithm's output, Bk{S), is the mid-point of an bucket / which intersects 
U. Thus the estimate {S) differs from x* by at most the sum of the radius of the neighbourhood U 
and the radius of the bucket. Since the length of a bucket is 1/k < \V\/2 and V C U, the sum of the 
radii is 

\U\/2+\V\/4<^\U\<e. 

Combining all the above, we have that for any e,6 > there exists toq such that for any m > mo, 
with probability at least 1 — 5 over the draw of an i.i.d. sample S of size m, |-Bfe(<S') — x*\ < e. This is the 
same as saying that B). is consistent for /. □ 

Note that in the above proof we cannot replace the condition k(m) ~ o{^Jm) with fc(m) = O(y^) 
since Vapnik-Chervonenkis bounds do not allow us to detect 0(l/-ym)-difference between probability 
masses of two buckets. 

The following theorems shows that if there are too many buckets the bucketing algorithm is not 

consistent anymore. 

Theorem 2. // the number of buckets k{m) = Loiraj logm), then Bk is not consistent for T\ . 

To prove the theorem we need a proposition of the following lemma dealing with the classical coupon 
collector problem. 

Lemma 1 (The Coupon Collector Problem [11]). Let the random variable X denote the number 
of trials for collecting each of the n types of coupons. Then for any constant c e M, and m = n\sin + cn, 

lim Pr[X > m] = 1 - e-^~° . 

n — ^oo 

Proof (of Theorem 2). Consider the following density / on [0, 1], 

r(4-16.T)/3 if.Te[0, i] 
/(x) = i (16x ^ 4)/3 ifxe(i,i) 
[4/3 ifa;G[il] 

which attains unique minimum at x* = 1/4. 

From the assumption on the growth of k{m) for all sufSciently large m, fc(m) > 4 and k{m) > 
8m/ In TO. Consider the all buckets lying in the interval [i, 1] and denote them by bi,b2. ■ ■ ■ , bn- Since the 
bucket size is less than 1/4, they cover the interval [|, 1]. Hence their length total length is at least 1/4 
and hence there are 

n > k{m)/4: > 2TO/lnTO 

such buckets. 

We will show that for m large enough, with probability at least 1/2, at least one of the buckets 
bi,b2, ■ ■ ■ ,bn receives no sample point. Since probability masses of 61, 62, • • • , are the same, we can 
think of these buckets as coupon types we are collecting and the sample points as coupons. By Lemma 1, 
it suffices to verify, that the number of trials, m, is at most ^nlnn. Indeed, we have 

1 1 2?7T- / 2?TZ \ TTl 

-n\nn> - In ) = (In m -|- In 2 — In In to) > to , 

2 2 m to \ m to y m to 

where the last inequality follows from that large enough to. Now, Lemma 1 implies that for sufficiently 
large to, with probability at least 1/2, at least one of the buckets 61, 62, • • • , &n contains no sample point. 

If there are empty buckets in [|, 1], the algorithm outputs a point in [^,1]. Since this happens with 
probability at least 1/2 and since x* = 1/4, the algorithm cannot be consistent. □ 



When the number of buckets k{m) is asymptoticahy somewhere in between ^/m and m/lnm, the 
bucketing algorithm switches from being consistent to faihng consistency. It remains an open question 
to determine where exactly the transition occurs. 



3.2 The Hard-Margin Algorithm 

Let the hard-margin algorithm be the function that outputs the mid-point of the largest interval between 
the adjacent sample points. More formally, given a sample S of size m, the algorithm sorts the sample 
S U {0, 1} so that xq = < Xi < X2 < • • • < Xm < i = x^+i and outputs the midpoint (x, + Xi+i)/2 
where the index i, < i < m, is such that the gap is the largest. 

Henceforth, the notion largest gap refers to the length of the largest interval between the adjacent 
points of a sample. 

Theorem 3. The hard-margin algorithm is consistent for the family T\. 

To prove the theorem we need the following property of the distribution of the largest gap between 
two adjacent elements of m points forming an i.i.d. sample from the uniform distribution on [0,1]. 
The statement of which we present an (up to our knowledge) new proof has been originally proven by 
Levy [10]. 

Lemma 2. Let Lm be the random variable denoting the largest gap between adjacent points of an i.i.d. 
sample of size m from the uniform distribution on [0, 1]. For any e > 



lim Pr 



, , Inm , Inm 
Lm e (1-e) ,(l + e) 



TO TO 



Proof (of Lemma). Consider the uniform distribution over the unit circle. Suppose we draw an i.i.d. 
sample of size m from this distribution. Let denote the size of the largest gap between two adjacent 
samples. It is not hard so see that the distribution of Km is the same as that of im-i- Furthermore, 
since \j^(^^+i)/(m+i) Ij we can thus prove the lemma with L„i replaced by K„i. 

Fix e > 0. First, let us show that for m sufficiently large is with probability 1 — o(l) above the 
lower bound (1 - e)^. We split the unit circle b = ^^^j^ buckets, each of length (1 - e)iiif-. It follows 
from Lemma 1, that for any constant C > and an i.i.d. sample of (1 — ()b\nb points at least one bucket 
is empty with probability 1 — o(l). We show that for some (, m < {1 — ()blnb. The expression on the 
right side can be rewritten as 



(1 - C)Mn6 = (1 - 0(1 + 6)-^ln ((1 - 0(1 + S) 

in m V In my 

>m(l-0(l + 5) fl-O^^'^^""^^ 



TO 



In 771 

For C. sufficiently small and to, sufficiently largo the last expression is greater than to, yielding that a 
sample of to points misses at least one bucket with probability 1 — o(l). Therefore, the largest gap Km 
is with probability 1 - o(l) at least (1 - e)^^. 

Next, we show that for m sufficiently large, Km is with probability 1 — o(l) below the upper bound 
(1 + e)^. We consider 3/e bucketings Bi, B2, ■ ■ ■ , B^/^. Each bucketing B^, i = {1, 2, . . . , (3/e)}, is 
a division of the unit circle into b = (^l_^_g/3^^lam ^^[ual length buckets; each bucket has length i = 
(1 + The bucketing Bi will have its left end-point of the first bucket at position i{£e/3). The 

position of the left end-point of the first bucket of a bucketing is called the offset of the bucketing. 



We first show that there exists C > such that m > (1 + Qblnb for all sufficiently large m. Indeed, 



'^ + <)"°^''^-^<' (l+"3)lnm '°( (l+"3)lnJ 



1 + C A flnlnm 
< —m 1 — C 



l + e/3 V V In 

For any ^ < e/3 and sufficiently large m the last expression is greater than m. 

The existence of such C and Lemma 1 guarantee that for all sufficiently large m, for of each bucketing 
Bi, with probability 1 — o(l), each bucket is hit by a sample point. We now apply union bound and get 
that, for all sufficiently large m, with probability 1 — (3/e)o(l) = 1 — o(l), for each bucketing Bi, each 
bucket is hit by at least one sample point. Consider any sample S such that for each bucketing, each 
bucket is hit by at least one point of S. Then, the largest gap in S can not be bigger than the bucket 
size plus the difference of offsets between two adjacent bucketings, since otherwise the largest gap would 
demonstrate an empty bucket in at least one of the bucketings. In other, words the largest gap. Km, is 
at most 

Km < {ee/3) + ^ = (1 + e/3)e = (1 + e/3f— < (1 + e)-"" 



m m 



for any e < 1. 



□ 



Proof (of the Theorem). Consider any two disjoint intervals U,V C [0, 1] such that for any x G U and 
any y GV, < p < 1 for some p € (0, 1). We claim that with probability 1 — o(l), the largest gap in 
U is bigger than the largest gap in V. 

If we draw an i.i.d. sample rn points from fi, according to the law of large numbers for an arbitrarily 
small X > 0; the ratio between the number of points mu in the interval U and the number of points my 
in the interval V with probability 1 — o(l) satisfies 

my \v\ 
For a fixed x, choose a constant e > such that > p + x- 

From Lemma 2 we show that with probability 1 — o(l) the largest gap between adjacent sample points 
falling into U is at least (1 — e)\U\ ^^mj ■ Similarly, with probability 1 — o(l) the largest gap between 
adjacent sample points faUing into V is at most (1 + e)|y| ^^^v ' ■^o'^ (^) follows that the ratio of gap 
sizes with probability 1 — o(l) is at least 



{l-e)\U\'^ 1-e 1 \nmu _ Inm^ 
+ 1 + ep + xlnmv ^ ^^Inmy 



> (1 + 7) ^-^ = (1 + 7) (1 + 0(l)/ln my) ^ (1 + 7) as m ^ 00 

In my 

for a constant 7 > such that 1 + 7 < j^j^- Hence for sufficiently large m with probability 1 — o(l), 
the largest gap in U is strictly bigger than the largest gap in V. 

Now, we can choose intervals Vi, V2 such that [0, 1] \ (Vi U V2) is an arbitrarily small neighbourhood 
containing x* . Wc can pick an even smaller neighbourhood U containing x* such that for all x gU and 
all y G Vi U V2, < p < 1 for some p G (0, 1). Then with probability 1 — o(l), the largest gap in U is 
bigger than largest gap in Vi and the largest gap in V2- □ 



4 Lecirning Linear Cut Separators in High Dimensions 



In this section we consider the problem of learning the minimum density homogeneous (i.e. passing 
through origin) linear cut in distributions over M.'^. Namely, assuming that some unknown probability 
distribution generates i.i.d. finite sample of points in R'^. We wish to process these samples to find the 
{d — l)-dimensional hyperplane, through the origin of M'', that has the lowest probability density with 
respect to the sample-generating distribution. In other words, we wish to find how to cut the space W'' 
through the origin in the "sparsest direction" . 

Formally, let !Fd be the family of all probability distributions over the M.'^ that have a continuous 
density function. We wish to show that there exists a linear cut learning algorithm that is consistent for 
!Fd- Note by Theorem 5, no algorithm achieves uniform convergence for (even for d= 1). 

Define the soft-margin algorithm with parameter 7 : N ^ K"'" as follows. Given a sample S of size m, 
it counts for every hyperplane, the number of sample points lying within distance 7 := 7(m) and outputs 
the hyperplane with the lowest such count. In case of the ties, it breaks them arbitrarily. We denote this 
algorithm by H^. Formally, for any weight vector w G S'^~^ (the unit sphere in R*^) we consider the 
"7-strip" 

/i(w, 7) = {x e M'' : |w'^x|<7} 

and count the number of sample points lying in it. We output the weight vector w for which the number 

of sample points in h(\v,"f) is the smallest; we break tics arbitrarily. 

To fully specify the algorithm, it remains to specify the function 7(m). As it turns out, there is a 
choice of the function 7(m) which makes the algorithm consistent. 

Theorem 4. // 7(771) = ^{l/y/rn) and 7(771) ^0 as m ^ 00, then is consistent for Td- 

Proof. The structure of the proof is similar to the proof of Theorem 1. However, we will need more 
technical tools. _ 

First let's fix /. For any weight vector w e <S''~^ and any 7 > 0, we define /-y(w) as the d-dimensional 
integral 



f^{w) := / /(x) dx 
over 7-strip along w. Note that for any w e .S*^"^, 



,. /7(m)(w) _ 

lim — — = /(w) 

m— *oo 7 



(assuming that j{m) 0). In other words, the sequence of functions |/-y(m)/7('^)| : f/li^) '■ 

L ^ ' J rTi=l 

.S"^"^ — > Mq , converges point- wise to the function / : 5''"^ — > Mq". 

Note that f/j{m) : S'''~^ Rq is continuous for any m, and recall that .S"^"^ is compact. Therefore 
the sequence I /-y(m)/7('^) r converges imiformly to /. In other words, for every C > there exists 

I- J m=l 

mo such that for any r?2 > and any w e <S''~^, 



/(w) 



7(m) 



Fix / and e, (5 > 0. Let [7 = {w e S'^''^ : |w'^w*| > 1 - e} be the "e-doubje-neighbourhood" of the 

antipodal pair {w*, — w*}. The set S'^~^ \ U \s compact and hence a :— min f {S'^~^ \ U) exists. Since 
w*, — w* are the only minimizers of /, a > /(w*) and hence rj := a — /(w*) is positive. 



The assumptions on 7(m) imply that there exists mo such that for all m > mo, 



„ d + \n(l/S) r] , , 
m 6 



/7(m)(w) - 



7(m) 



/(w) 



<7?/3 



Fix any m > mo- For any w e iS** ^ \ {/, we have 

77(m)(w) 



7(m) 



> /(w) - ,7/3 

>7(w*)+ry-r7/3 
= 7(w*)+2r;/3 



> 



7(m) 



7?/3 + 2r?/3 



for all w e 5"^-^ 



by (10) 

by choice of rj and C/ 

by (10) 



(9) 
(10) 



/7(m)(w* 

7(m) 



77/3. 



Prom the above chain of inequalities, after multiplying by 7(m), we have 

7^(m)M > 77(m)(w*) +»?7(to)/3 ■ 



(11) 



From the well known Vapnik-Chervonenkis bounds [2], we have that with probability at least 1 — ^ 
over i.i.d. draws of S of size m we have that for any w. 



IMw,7)n5| = 

/7(m)W 



< 



d + In(l/(5) 



(12) 



where |/i(w,7) n 5*1 denotes the number of sample points lying in the 7-strip /i(w,7). 
Fix any sample S satisfying the inequality (12). We have, for any w G iS''"^ \ U, 



|/i(w,7)n5| 



m 



> /7(m)(w) 



d + ln(l/^) 
m 



>/7(m)(w*)+»77(m)/3- 



d + ln(l/(5) 
m 



> 



m 

\h{w*,^)ns\ 

m 



^ |/i(w*,7)n5| ^/ c^ + ln(l/J) ^ _/d + ln(l/5) 



m 



m 



Since \h(w,j)riS\ > |ft-(w*, 7) n S'|, the algorithm must not output a weight vector w lying in S \U. 
In other words, the algorithm's output, H^{S), lies in U i.e. \Hj{S)'^w*\ > 1 — e. 

We have proven, that for any e,S > 0, there exists tuq such that for all m > toq, if a sample S is 
drawn i.i.d. from /, then \H^{S)'^w*\ > 1 — e. In other words, is consistent for /. □ 



5 The impossibility of Uniform Convergence 



In this section we show a negative result that roughly says one cannot hope for an algorithm that can 
achieve e accuracy and 1 — S confidence for sample sizes that only depend on these parameters and not 
on properties of the probability measure. 

Theorem 5. No linear cut learning algorithm is uniformly convergent for Ti with respect to any of the 
distance functions De, Df and D^. 

Proof. For a fixed d > we show that for any m £ N there are distributions with density functions / and 
g such that no algorithm using a random sample of size at most m drawn from one of the distributions 
chosen uniformly at random, can identify the distribution with probability of error less than 1/2 with 
probability at least d over random choices of a sample. 

Since for any S and m we find densities / and g such that with probability more than (1 — S) the 
output of the algorithm is bounded away by 1/4 from either 1/4 or 3/4, for the family J^i no algorithm 
converges uniformly w.r.t. any distance measure. 

Consider two partly linear density functions / and g defined in [0, 1] such that for some n, f is linear 
in the intervals [0, ^ - [j~ [|, | + ^)], and + ^, 1], and satisfies 

and gm is the reflection of fm w.r.t. to the centre of the unit interval, i.e. f{x) — g{l~x). The functions / 
and g can be simply described as constant functions anywhere except of a thin V-shape around 1/4 resp. 
3/4 with the bottom at in each of them. For any x ^ [i ~ ^> i + ^] U [f - | + f{x) = g{x). 




1/4 X* 

Fig. 1. / is uniform everywhere except a small neighbourhood around 1/4 where it has a sharp 'v' shape. And 
g is the reflection of / about x = 1/2. 



Let us lower-bound the probability that a sample of size m drawn from / misses the set U UV for 
^ — [3 - 3 + ^] and := [| - ^, I + For smy x € U and y^U, f{x) < f{y), and furthermore, 
/ is constant on the set [0, 1] \ containing at most the entire probability mass 1. Therefore, for Pf{Z) 
denoting the probability that a point drawn from the distribution with the density / hits the set Z, we 
have Pf{U) < Pf{V) < ;jzy, yielding that pf{U U V) < Hence, an i.i.d. sample of size m misses 

UUV with probability at least (1 — 2/(n — 1))™ > (1 — 7y)e^^™/" for any constant rj > and n sufficiently 
large. For a proper 77 and n sufficiently large we get (1 — 7y)e^^™/" > I — S. From the symmetry between 
/ and a random sample of size m drawn from g misses U UV with the same probability. 

We have shown that for any S > 0, m E N, and for n sufficiently large, regardless of whether the 
sample is drawn from either of the two distributions, it does not intersect U UV with probability more 
than 1 — 6. Since in [0, 1] \ (C/ U V) both density functions are equal, the probability of error in the 
discrimination between / and g conditioned on that the sample does not intersect U UV cannot be less 
than 1/2. 

□ 



6 Conclusions and open questions 

In this paper have presented a novel unsupervised learning problem that is modest enough to allow learn- 
ing algorithm with asymptotic learning guarantees, while being relevant to several central challenging 
learning tasks. Our analysis can be viewed as providing justification to some common semi-supervised 
learning paradigms, such as the maximization of margins over the unlabeled sample or the search for 
empirically-sparse separating hyperplanes. As far as we know, our results provide the first performance 
guarantees for these paradigms. 

From a more general perspective, the paper demonstrates some type of meaningful information about 
a data generating probability distribution that can be reliably learned from finite random samples of 
that distribution, in a fully non-parametric model - without postulating any prior assumptions about 
the structure of the data distribution. As such, the search for a low-density data separating hypcrplane 
can be viewed as a basic tool for the initial analysis of unknown data. Analysis that can be carried out 
in situations where the learner has no prior knowledge about the data in question and can only access it 
via unsupervised random sampling. 

Our analysis raises some intriguing open questions. First, note that while we prove the universal 
consistency of the 'hard-margin' algorithm for Real data distributions, we do not have a similar result 
for higher dimensional data. Since searching for empirical maximal margins is a common heuristic, it is 
interesting to resolve the question of consistency of such algorithms. 

Another natural research direction that this work calls for is the extension of our results to more 
complex separators. In clustering, for example, it is common to search for clusters that are separated 
by sparse data regions, however, such between-cluster boundaries are often not linear. Can one provide 
any reliable algorithm for the detection of sparse boundaries from finite random samples when these 
boundaries belong to a richer family of functions? 

Our research has focused on the information complexity of the task. However, to evaluate the prac- 
tical usefulness of our proposed algorithms, one should also carry a computational complexity analysis 
of the low-density separation task. We conjecture that the problem of finding the homogeneous hyper- 
plane with largest margins, or lowest density around it (with respect to a finite high dimensional set of 
points) is NP-hard (when the Euclidean dimension is considered as part of the input, rather than as a 
fixed constant parameter), however, even if this conjecture is true, it will be interesting to find efficient 
approximation algorithms for these problems. 
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